Development of bespoke machine learning and biocuration workflows in a BioC-supporting text mining workbench
نویسندگان
چکیده
As part of our participation in the Collaborative Biocurator Assistant Task of BioCreative V, we developed methods and tools for recognising and normalising mentions denoting genes/proteins and organisms. A combination of different approaches were used in addressing these tasks. The recognition of gene/protein and organism names was cast as a sequence labelling problem to which the conditional random fields algorithm was applied. In training our models, various lexical and orthographic features were extracted over the CHEMDNER GPRO and S800 corpora which were leveraged as gold standard data. Our feature set was further enriched with semantic attributes drawn from matches between mentions in text and entries in relevant dictionaries. In normalising recognised names, i.e., assigning resource identifiers to recognised names in text, the Jaro-Winkler and Levenshtein distance measures were used to estimate string similarity and -rank candidate matches. Integration of the various techniques and resources was facilitated by the Web-based Argo text mining workbench which allows for the straightforward construction of automatic text processing workflows. Upon using our training workflow to produce gene/protein and organism name recognition models and subsequently evaluating them, micro-averaged F-scores of 70% and 72.87% were obtained, respectively. Curation workflows applying our models on the provided BioC corpus of 120 full-text PubMed Central documents generated normalised named entity annotations which were serialised in the required BioC format.
منابع مشابه
Text-mining-assisted biocuration workflows in Argo
Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into cu...
متن کاملArgo: enabling the development of bespoke workflows and services for disease annotation
Argo (http://argo.nactem.ac.uk) is a generic text mining workbench that can cater to a variety of use cases, including the semi-automatic annotation of literature. It enables its technical users to build their own customised text mining solutions by providing a wide array of interoperable and configurable elementary components that can be seamlessly integrated into processing workflows. With Ar...
متن کاملPOSBIOTM/W: A Development Workbench for Machine Learning Oriented Biomedical Text Mining System
The POSBIOTM/W1 is a workbench for machine-learning oriented biomedical text mining system. The POSTBIOTM/W is intended to assist biologist in mining useful information efficiently from biomedical text resources. To do so, it provides a suit of tools for gathering, managing, analyzing and annotating texts. The workbench is implemented in Java, which means that it is platform-independent.
متن کاملText mining for the biocuration workflow
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documen...
متن کاملBuilding Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench
Argo is a web-based NLP and text mining workbench with a convenient graphical user interface for designing and executing processing workflows of various complexity. The workbench is intended for specialists and nontechnical audiences alike, and provides the ever expanding library of analytics compliant with the Unstructured Information Management Architecture, a widely adopted interoperability ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015